K Modes Clustering Algorithm Based on a New Distance Measure
نویسنده
چکیده
T he leading par tit ional clustering technique, K Modes, is one of the most computationally eff icient clustering methods fo r categ orical data. In the t raditional K Modes algo rithm, the simple matching dissim ilarity measure is used to compute the distance betw een two values of the same catego rical at t ributes. T his compares tw o categorical v alues directly and results in either a difference of zero w hen the tw o values are ident ical or one if otherw ise. How ever, the similarity betw een catego rical values is not considered. In this paper, a new distance measure based on rough set theory is pr opo sed, w hich overcomes the shortage of the simple matching dissimilarity measure and is used along w ith the t radit ional K Modes cluster ing algo rithm. While comput ing the distance betw een tw o values of the same catego rical att ributes, the new distance measure takes into account not only their dif ference but also discernibility o f other relat ional categ orical at tr ibutes to them. T he t ime complexity of the modif ied K Modes clustering alg orithm is linear w ith r espect to the number of data objects w hich can be applied for lar ge data sets. The performance of the K Modes alg orithm w ith the new distance measure is tested on real w orld data sets. Comparisons w ith the K Modes algor ithm based on many dif ferent distance measures illust rate the effect iveness o f the new distance measure.
منابع مشابه
An Optimization K-Modes Clustering Algorithm with Elephant Herding Optimization Algorithm for Crime Clustering
The detection and prevention of crime, in the past few decades, required several years of research and analysis. However, today, thanks to smart systems based on data mining techniques, it is possible to detect and prevent crime in a considerably less time. Classification and clustering-based smart techniques can classify and cluster the crime-related samples. The most important factor in the c...
متن کاملWeighted Ensemble Clustering for Increasing the Accuracy of the Final Clustering
Clustering algorithms are highly dependent on different factors such as the number of clusters, the specific clustering algorithm, and the used distance measure. Inspired from ensemble classification, one approach to reduce the effect of these factors on the final clustering is ensemble clustering. Since weighting the base classifiers has been a successful idea in ensemble classification, in th...
متن کاملA partition-based algorithm for clustering large-scale software systems
Clustering techniques are used to extract the structure of software for understanding, maintaining, and refactoring. In the literature, most of the proposed approaches for software clustering are divided into hierarchical algorithms and search-based techniques. In the former, clustering is a process of merging (splitting) similar (non-similar) clusters. These techniques suffered from the drawba...
متن کاملA Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach
In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...
متن کاملGenetic Distance Measure for K-modes Algorithm
K-means algorithm has been shown to be an effective and efficient algorithm for clustering. However, the k-means algorithm is developed for numerical data only. It is not suitable for the clustering of non-numerical data. K-modes algorithm has been developed for clustering categorical objects by extending from the k-means algorithm. However, no one applies this technique for classification of c...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010